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Abstract — We consider the problem of reconstructing vehicle 
trajectories from sparse sequences of GPS points, for which 
the sampling interval is between 10 seconds and 2 minutes. 
We introduce a new class of algorithms, called altogether path 
inference filter (PIF), that maps GPS data in real time, for a 
variety of trade-offs and scenarios, and with a high throughput. 
Numerous prior approaches in map-matching can be shown to be 
special cases of the path inference filter presented in this article. 
We present an efficient procedure for automatically training the 
filter on new data, with or without ground truth observations. The 
framework is evaluated on a large San Francisco taxi dataset and 
is shown to improve upon the current state of the art. This filter 
also provides insights about driving patterns of drivers. The path 
inference filter has been deployed at an industrial scale inside 
the Mobile Millennium traffic information system, and is used to 
map fleets of data in San Francisco, Sacramento, Stockholm and 
Porto. 

I. Introduction 

Amongst the modern man-made plagues, traffic congestion 
is a universally recognized challenge |11|. Building reliable 
and cost-effective traffic monitoring systems is a prerequisite 
to addressing this phenomenon. Historically, the estimation 
of traffic congestion has been limited to highways, and has 
relied mostly on a static, dedicated sensing infrastructure such 
as loop detectors or cameras f4T |. The estimation problem is 
more challenging in the case of the secondary road network, 
also called the arterial network, due to the cost of deploying 
a wide network of sensors in large metropolitan areas. The 
most promising source of data is the GPS receiver in personal 
smartphones and commercial fleet vehicles. According to some 
studies 1 32 1, devices with a data connection and a GPS 
will represent 80% of the cellphone market by 2015. GPS 
observations in cities are noisy flOl , and are usually provided 
at low sampling rates (on the order of one minute) | 9 |. One 
of the common problems which occurs when dealing with 
GPS traces is the correct mapping of these observations to 
the road network, and the reconstruction of the trajectory of 
the vehicle. We present a new class of algorithms, called the 
path inference filter, that solves this problem in a principled 
and efficient way. Specific instantiations of this algorithm have 
been deployed as part of the Mobile Millennium system, which 
is a traffic estimation and prediction system developed at the 
University of California [2J. Mobile Millennium infers real- 
time traffic conditions using GPS measurements from drivers 
running cell phone applications, taxicabs, and other mobile and 
static data sources. This system was initially deployed in the 
San Francisco Bay area and later expanded to other locations 
such as Sacramento, Stockholm, and Porto. 




Figure 1. An example of dataset available to Mobile Millennium and 
processed by the path inference filter: taxicabs in San Francisco from the 
Cabspotting program |9 |. Large circles in red show the position of the taxis 
at a given time and small dots (in black) show past positions (during the last 
five hours) of the fleet. The position of each vehicle is observed every minute. 

GPS receivers have enjoyed a widespread use in trans- 
portation and they are rapidly becoming a commodity. They 
offer unique capabilities for tracking fleets of vehicles (for 
companies), and routing and navigation (for individuals). 
These receivers are usually attached to a car or a truck, also 
called a probe vehicle, and they relay information to a base 
station using the data channels of cellphone networks (3G, 
4G). A typical datum provided by a probe vehicle includes an 
identifier of the vehicle, a (noisy) position and a timestampj^ 
Figure [T] graphically presents a subset of probe data collected 
by Mobile Millennium. In addition to these geolocalization 
attributes, data points contain other attributes such as heading, 
speed, etc. We will show how this additional information can 
be integrated in the rest of the framework presented in this 
article. 

The two most important characteristics of GPS data for 
traffic estimation purposes are the GPS localization accuracy 
and the sampling strategy followed by the probe vehicle. In 
order to reduce power consumption or transmission costs, 

^The experiments in this article use GPS observations only. However, 
nothing prevents the application of the algorithms presented in this article 
to other types of localized data. 
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probe vehicles do not continuously report their location to the 
base station. The probe data currently available are generated 
using a combination of the two following strategies: 

• Geographical sampling: GPS probes are programmed to 
send information in the vicinity of virtual landmarks 1231 . 
This concept was popularized by Nokia under the term 
Virtual Trip Line |[T9ll . These landmarks are usually laid 
over some predetermined route followed by drivers. 

• Temporal sampling: GPS probes send their position at 
fixed rate. The critical factor is then the temporal resolu- 
tion of the probe data. A low temporal resolution carries 
some uncertainty as to which trajectory was followed. 
A high temporal resolution gives access to the complete 
and precise trajectory of the vehicle. However, the de- 
vice usually consumes more power and communication 
bandwidth. 

In the case of a high temporal resolution (typically, a frequency 
greater than an observation per second), some highly success- 
ful methods have been developed for continuous estimation 
1381 , 1241 . |[T2l . However, most data collected at large scale 
today is generated by commercial fleet vehicles. It is primarily 
used for tracking the vehicles and usually has a low temporal 
resolution (1 to 2 minutes) 1271 . 1211 . 1351 . 191 . In the span of a 
minute, a vehicle in a city can cover several blocks. Therefore, 
information on the precise path followed by the vehicle is 
lost. Furthermore, due to GPS localization errors, recovering 
the location of a vehicle that just sent an observation is a non 
trivial task: there are usually several streets that could be com- 
patible with any given GPS observation. Simple deterministic 
algorithms to reconstruct trajectories fail due to misprojection 
(Figure |3]) or shortcuts (Figure [2]). Such shortcomings have 
motivated our search for a principled approach that jointly 
considers the mapping of observations to the network and the 
reconstruction of the trajectory. 

The problem of mapping data points onto a map can be 
traced back to 1980 |3|. Researchers started systematic studies 
after the introduction of the GPS system to civilian appli- 
cations in the 1990s 1(311 . These early approaches followed 
a geometric perspective, associating each observation datum 
to some point in the network |40|. Later, this projection 
technique was refined to use more information such as heading 
and road curvature. This greedy matching, however, leads to 
poor trajectory reconstruction since it does not consider the 
path leading up to a point 1421 . New deterministic algorithms 
emerged to directly match partial trajectories to the road by 
using the topology of the network L16J and topological metrics 
based on the Frechet distance O, 1391 . These deterministic 
algorithms cannot readily cope with ambiguous observations 
1241, and were soon expanded into probabilistic frameworks. 
A number of implementations were explored: particle filters 
1291 . ri7M . Kalman filters |28|, Hidden Markov Models 141, 
and less mainstream approaches based on Fuzzy Logic and 
Belief Theory. 

Two types of information are missing in a sequence of GPS 
readings: the exact location of the vehicle on the road network 
when the observation was emitted, and the path followed from 
the previous location to the new location. These problems 



are correlated. The aforementioned approaches focus on high- 
frequency sampling observations, for which the path followed 
is extremely short (less than a few hundred meters, with very 
few intersections). In this context, there is usually a dominant 
path that starts from a well-defined point, and Bayesian filters 
accurately reconstruct paths from observations 1281 . 1381 . ifTTl . 
When sampling rates are lower and observed points are further 
apart, however, a large number of paths are possible between 
two points. Researchers have recently focused on efficiently 
identifying these correct paths and have separated the joint 
problem of finding the paths and finding the projections into 
two distinct problems. The first problem is path identification 
and the second step is projection matching 1431 . B1 . B2l . ifTSl . 
|37|. Some interesting trajectories mixing points and paths that 
use a voting scheme have also recently been proposed |42l. 
Our filter aims at solving the two problems at the same time, 
by considering a single unified notion of trajectory. 

The path inference filter is a probabilistic framework that 
aims at recovering trajectories and road positions from low- 
frequency probe data in real time, and in a computationally 
efficient manner. As will be shown, the performance of the 
filter degrades gracefully as the sampling frequency decreases, 
and it can be tuned to different scenarios (such as real time 
estimation with limited computing power or offline, high 
accuracy estimation). 

The filter is justified from the Bayesian perspective of the 
noisy channel and falls into the general class of Conditional 
Random Fields 1221 . Our framework can be decomposed into 
the following steps: 

• Map matching: each GPS measurement from the input is 
projected onto a set of possible candidate states on the 
road network. 

• Path discovery: admissible paths are computed between 
pairs of candidate points on the road network. 

• Filtering: probabilities are assigned to the paths and the 
points using both a stochastic model for the vehicle 
dynamics and probabilistic driver preferences learned 
from data. 

According to the very exhaustive review by Quddus et al. 
1 30 1, most map-matching approaches fall into one of the four 
categories: 

1) "Geometric" methods, which pick the closest matching 
point. The distance metric itself is the subject of varia- 
tions by different authors. 

2) "Weighted topological" methods, which use connectivity 
information between links and various ways to weight 
the different paths. 

3) "Probabilistic" methods, which combine variance in- 
formation about the points and topological information 
about the paths in a simple way. 

4) "Advanced" methods, which encompass everything more 
complicated: Kalman Filtering, Particle Filtering, Belief 
Theory 1 13 | and Fuzzy Logic 1341 

^Note that "probabilistic" models, as well as most of the "advanced" models 
(Kalman Filtering, Particle Filtering, Hidden Markov Models) fall under the 
general umbrella of Dynamic Bayesian Filters, presented in great detail in 
|38|. As such, they deserve a common theoretical treatment, and in particular 
all suffer from the same pitfalls detailed in Section 
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The path inference filter presents a number of compelling 
advantages over the work found in the current literature: 

1) The approach presents a general framework grounded in 
established statistical theory that encompasses, as special 
cases, most techniques presented as "geometric", "topo- 
logical" or "probabilistic". In particular, it combines 
information about paths, points and network topology 
in a single unified notion of trajectory. 

2) Nearly all work on Map Matching is segmented into 
(and presents results for) either high-frequency or low- 
frequency sampling. The path inference filter performs 
as well as the current state-of-the-art approaches for 
sampling rates less than 30 seconds, and improves upon 
the state of the art (431, 1421 by a factor of more than 
10% for sampling intervals greater than 60 second^ We 
also analyze failure cases and we show that the output 
provided by the path inference filter is always "close" 
to the true output for some metric. 



3) 



4) 



As will be seen in Section III most existing approaches 
(which are based on Dynamic Bayesian Networks) do 
not work well at lower frequencies due to the selection 
bias problem. Our work directly addresses this problem 
by performing inference on a Random Field. 
The path inference filter can be used with complex 
path models such as those used in IH and ifTSl . In the 
present article, we restrict ourselves to a class of models 
(the exponential family distributions) that is rich enough 
to provide insight on the driving patterns of the vehi- 
cles. Furthermore, when using this class of models, the 
learning of new parameters leads to a convex problem 
formulation that is fast to solve. These parameters can 
be learned using standard Machine Learning algorithms, 
even when no ground truth is available. 
5) With careful engineering, it is possible to achieve high 
throughput on large-scale networks. Our reference im- 
plementation achieves an average throughput of hun- 
dreds of GPS observations per second on a single core 
in real time. Furthermore, the algorithm scales well on 
multiple cores and has achieved average throughput of 
several thousands of points per second on a multicore 
architecture. 

Algorithms often need to trade off accuracy for timeliness, and 
are considered either "local" (greedy) or "global" (accumulat- 
ing some number of points before returning an answer) 1421 . 
The path inference filter is designed to work across the full 
spectrum of accuracy versus latency. As we will show, we can 
still achieve good accuracy by delaying computations by only 
one or two time steps. 

II. Path discovery 

The road network is described as a directed graph M = 
(V, f ) in which the nodes are the street intersections and the 
edges are the streets, referred to in the text as the links of 
the road network. Each link is endowed with a number of 

^Performance comparisons are complicated by the lack of a single agreed- 
upon benchmark dataset. Nevertheless, the city we study is complex enough 
to compare favorably with cities studied with other works. 




Figure 2. Example of failure when using an intuitive algorithm projects 
each GPS measurement to the closest link. The raw GPS measurements are 
the triangles, the actual true trajectory is the dashed line, and the reconstructed 
trajectory is the continuous line. Due to noise in the observation, the point at 
t = 2 is closer to the orthogonal road and forces the algorithm to add a left 
turn, while the vehicle is actually going straight. This problem is frequently 
observed for GPS data in cities. The path inference filter provides one solution 
to this problem. 




Figure 3. Example of failure when trying to minimize the path length between 
a sequence of points. The raw observations are the triangles, the actual true 
trajectory is the dashed line, and the reconstructed trajectory is the continuous 
line. The circles are possible locations of the vehicle corresponding to the 
observations. The hashed circles are the states chosen by this reconstruction 
algorithm. Due to GPS errors that induce problems explained in Figure [2] we 
must consider point projections on all links within a certain distance from 
the observed GPS points. However, the path computed by a shortest path 
algorithm may not correspond to the true trajectory. Note how, for t = 2 and 
t = 3, the wrong link and the wrong states are elected to reconstruct the 
trajectory. 



physical attributes (speed limit, number of lanes, type of road, 
etc.). Given a link of the road network, the links into which a 
vehicle can travel will be called outgoing links, and the links 
from which it can come will be called the incoming links. 
Every location on the road network is completely specified 
by a given link / and offset o on this link. The offset is 
a non-negative real number bounded by the length of the 
corresponding link, and represents the position on the link. 
At any time, the state x of a vehicle consists of its location 
on the road network and some other optional information such 
as speed, or heading. For our example we consider that the 
state is simply the location on one of the road links (which 
are directed). Additional information such as speed, heading, 
lane, etc. can easily be incorporated into the state: 
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Furthermore, for the remainder of this article we consider 
trajectory inference for a single probe vehicle. 

A. From GPS points to discrete vehicle states 

The points are mapped to the road following a Bayesian 
formulation. Consider a GPS observation g. We study the 
problem of mapping it to the road network according to 
our knowledge of how this observation was generated. This 
generation process is represented by a probability distribution 
00 {g\x) that, given a state x, returns a probability distribution 
over all possible GPS observations g. Such distributions uj 
will be described in Section |III-D[ Additionally, we may 
have some prior knowledge over the state of the vehicle. For 
example, some links may be visited more often than others, 
and some positions on links may be more frequent, such as 
when vehicles accumulate at the intersections. This knowledge 
can be encoded in a prior distribution VL{x). Under this 
general setting, the state of a vehicle, given a GPS observation, 
can be computed using B ayes' rule: 

TT {x\g) (X UJ {g\x) Vt {x) 

The letter tt will define general probabilities, and their depen- 
dency on variables will always be included. This probability 
distribution is defined up to a scaling factor in order to 
integrate to 1. This posterior distribution is usually com- 
plicated, owing to the mixed nature of the state. The state 
space is the product of a discrete space over the links and a 
continuous space over the link offsets. Instead of representing 
it in closed form, some sampled values are considered: for 
each link a finite number of states from this link are 
elected to represent the posterior distribution of the states 
on this link it {o\g^l = li). A first way of accomplishing this 
task is to grid the state space of each link, as illustrated in 
Figure |4] This strategy is robust against the observation errors 
described in Section |II-B[ but it introduces a large number 
of states to consider. Furthermore, when new GPS values are 
observed every minute, the vehicle can move quite extensively 
between updates. The grid step is usually small compared to 
the distance traveled. Instead of defining a coarse grid over 
each link, another approach is to use some most likely state 
on each link. Since our state is the pair of a link and an offset 
on this link, this corresponds to selecting the most likely offset 
on each state: 



VL,, 



argmax tt (o|^, I = k) 



In practice, the probability distribution tt {x\g) decays 
rapidly, and can be considered overwhelmingly small beyond a 
certain distance from the observation g. Links located beyond 
a certain radius need not be considered valid projection links, 
and may be discarded. 

In the rest of this article, the boldface symbol x will denote a 
(finite) collection of states associated with a GPS observation g 
that we will use to represent the posterior distribution n {x\g), 
and the integer / will denote its cardinality: x = {xi)-^.j. 
These points are called candidate state projections for the 
GPS measurement g. These discrete points will then be linked 
together through trajectory information that takes into account 



t 



n{x\g) 



State-space gridding 




^post 

(Most likely posterior) 



(Most likely measure- 
ment prior) 



Figure 4. Example of a measurement g on a link and two strategies to 
associate state projections to that measurement on a particular link (gridding 
and most likely location). The GPS measurement is the triangle denoted g. 
For this particular measurement, the observation distribution uu (x\g) and the 
posterior distribution tt (x\g) are also represented. When gridding, we select 
a number of states xi, • • • xj spanning each link at regular intervals. This 
allows us to use the posterior distribution and have a more precise distribution 
over the location of the vehicle. However, it is more expensive to compute. 
Another strategy is to consider a single point at the most probable offset 
XpQgj according to the posterior distribution 7T(x\g). However, this location 
depends on the prior, which is usually not available at this stage (since the 
prior depends on the location of past and future points, for which do not also 
know the location). A simple approximation is to consider the most likely 
point a:*^^^ according to the observation distribution. 



the trajectory and the dynamics of the vehicle. We now 
mention a few important points for a practical implementation. 

The prior distribution. A Bayesian formulation requires 
that we endow the state x with a prior distribution Q{x) 
that expresses our knowledge about the distribution of points 
on a link. When no such information is available, since the 
offset is continuous and bounded on a segment, a natural 
non-informative prior is the uniform distribution over offsets: 
ft ^ U ([0, length (/)]). In this case, maximizing the posterior 
is equivalent to maximizing the conditional distribution from 
the generative model: 



* ' ^observation 



argmax^cj {g\x = {o.k)) 



Having mapped GPS points into discrete points on the road 
network, we now turn our attention to connecting these points 
by paths in order to form trajectories. 

B. From discrete vehicle states to trajectories 

At each time step t, a GPS point g^ (originating from a 
single vehicle) is observed. This GPS point is then mapped 
onto P different candidate states denoted = x\ - ■ ■ x\f 
Because this set of projections is finite, there is only a (small) 
finite number of paths that a vehicle can have taken while 




Figure 5. Example of path exploration between two observations. The true 
trajectory and two associated GPS observations are shown on the upper left 
corner. The upper right corner figure shows the set of candidate projections 
associated with each observation. A path discovery algorithm computes every 
acceptable path between between each pair of candidate projections. The four 
figures at the bottom show a few examples of such computed paths. 

moving from some state xj G to some state x-,^^ G x^+^. 
We denote the set of candidate paths between the observation 
and the next observation g^^^ by : 

Each path goes from one of the projection states x\ of g^ to 
a projection state x-,^^ of g^^^. There may be multiple pairs 
of states to consider, and between each pair of states, there 
are typically several paths available (see Figure [5]). Lastly, a 
trajectory is defined by the succession of states and paths, 
starting and ending with a state: 

r = X1P1X2 ■ --Vt-iXt 

where xi is one element of x^, pi of p^, and so on. 

Due to speed limits leading to lower bounds on achievable 
travel times on the network, there is only a finite number of 
paths a vehicle can take during a time interval At. Such paths 
can be computed using standard graph search algorithms. The 
depth of the search is bounded by the maximum distance 
a vehicle can travel on the network at a speed Vmax within 
the time interval between each observation. An algorithm that 
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Figure 6. Example of failure when observing strict physical consistency: due 
to the observation noise, the observation (3) appears physically behind (2) on 
the same link. Without considering backward paths, the most plausible expla- 
nation is that the vehicle performed a complete loop around the neighboring 
block. 

performs well in practice is the A* algorithm fTSl, a common 
graph search algorithm that makes use of a heuristic to guide 
its search. The cost metric we use here is the expected travel 
time on each link, and the heuristic is the shortest geographical 
distance, properly scaled so that it is an admissible heuristic. 

The case of backward paths. It is convenient and realistic 
to assume that a vehicle always drives /(9warJ, i.e. in the same 
direction of a linlQ In our notation, a vehicle enters a link 
at offset 0, drives along the link following a non-decreasing 
offset, and exits the link when the offset value reaches the 
total length of the link. However, due to GPS noise, the most 
likely state projection of a vehicle waiting at a red light may 
appear to go backward, as shown in Figure |6] This leads to 
incorrect transitions if we assume that paths only go forward 
on a link. Three approaches to solve this issue are discussed, 
depending on the application: 

1) It is possible to keep a single state for each link (the most 
likely) and explore some backward paths. These paths 
are assumed to go backward because of observation 
noise. This solution provides connected states at the 
expense of physical consistency, all the measurements 
are correctly mapped to their most likely location, but 
the trajectories themselves are not physically acceptable. 
This is useful for applications that do not require con- 
nectedness between pairs of states, for example when 
computing a distribution of the density of probe data 
per link. 

2) It is also possible to disallow backward paths and 
consider multiple states per link, such as a grid over 
the state space. A vehicle never goes backward, and 
in this case the filter can generally account for the 
vehicle not moving by associating the same state to suc- 
cessive observations. All the trajectories are physically 
consistent and the posterior state density is the same as 
the probability density of the most likely states, but is 

"^Reverse driving is in some cases even illegal. For example, the laws of 
Glendale, Arizona, prohibit reverse driving. 
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more burdensome from a computational perspective (the 
number of paths to consider grows quadratically with the 
number of states). 
3) Finally it is possible to disallow backward paths and use 
a sparse number of states. The path connectivity issue 
is solved using some heuristics. Our implementation 
creates a new state projection on a link / using the 
following approach: 

Given a new observation g, and its most likely state 
projection = (/,o*): 

a) If no projection for the link / was found at the 
previous time step, return x* 

b) If such a projection Xbefore = (^c>before) existed, 
return x = max (obefore^ o^)) 

With this heuristic, all the points will be well connected, 
but the density of the states will not be the same as the 
density of the most likely reconstructed states. 
In summary, the first solution is better for density estimations 
and the third approach works better for travel time estimations. 
The second option is currently only used for high-frequency 
offline filtering, for which paths are short, and for which more 
expensive computations is an acceptable cost. 

Handling errors Maps may contains some inaccuracies, 
and may not cover all the possible driving patterns. Two errors 
were found to have a serious effect on the performance of the 
filter: 

• Out of network driving: This usually occurs in parking 
lots or commercials driveways. 

• Topological errors: Some links may be missing on the 
base map, or one-way streets may have changed to two- 
way streets. These situations are handled by running flow 
analysis on the trajectory graph. For every new incoming 
GPS point, after computing the paths and states, it is 
checked if any candidate position of the first point of 
the trajectory is reachable from any reachable candidate 
position on the latest incoming point, or equivalently 
if the trajectory graph has a positive flow. The set of 
state projections of an observation may end up being 
disconnected from the start point even if at every step, 
there exists a set of paths between each points. In this 
situation, the probability model will return a probability 
of (non-physical trajectories) for any trajectory. If 
a point becomes unreachable from the start point, the 
trajectory is broken, and restarted again from this point. 
Trajectory breaks were few (less than a dozen for our 
dataset), and a visual inspection showed that the vehicle 
was not following the topology of the network and instead 
made U-turns or breached through one-way streets. 

III. Discrete filtering using a Conditional 
Random Field 

In the previous section, we reduced the trajectory recon- 
struction problem to a discrete selection problem between 
sets of candidate projection points, interleaved with sets of 
candidate paths. A probabilistic framework can now be applied 
to infer a reconstructed trajectory r or probability distributions 
over candidate candidate states and candidate paths. Without 



further assumptions, one would have to enumerate and com- 
pute probabilities for every possible trajectory. This is not 
possible for long sequences of observations, as the number 
of possible trajectories grows exponentially with the number 
of observations chained together. By assuming additional inde- 
pendence relations, we turn this intractable inference problem 
into a tractable one. 

A. Conditional Random Fields to weight trajectories 

The observation model provides the joint distribution of 
a state on the road network given an observation. We have 
described the noisy generative model for the observations in 



Section |II-A[ Assuming that the vehicle is at a point x, a 
GPS observation g will be observed according to a model uj 
that describes a noisy observation channel. The value of g 
only depends on the state of the vehicle, i.e. the model reads 
UJ {g\x). For every time step t, assuming that the vehicle is at 
the location x^, a GPS observation g^ is created according to 
the distribution u {g^\x^). 

Additionally, we endow the set of all possible paths on the 
road network with a probability distribution. The transition 
model T] describes the preference of a driver for a particular 
path. In probabilistic terms, it provides a distribution r] (p) 
defined over all possible paths p across the road network. This 
distribution is not a distribution over actually observed paths 
as much as a model of the preferences of the driver when 
given the choice between several options. 

We introduce the following Markov assumptions. 

• Given a start state Xstart and an end state Xend. the path 
p followed by the vehicle between these two points will 
only depend on the start state, the end state and the path 
itself. In particular, it will not depend on previous paths 
or future paths. 

• Consider a state x followed by a path pnext and preceded 
by a path ^previous, and associated to a GPS measurement 
g. Then the paths taken by the vehicle are independent 
from the GPS measurement g if the state x is known. 
In other words, the GPS measurement does not add 
subsequent information given the knowledge of the state 
of the vehicle. 

Since a state is composed of an offset and a link, a path is 
completely determined by a start state, an end state and a list 
of links in between. Conditional on the start state and end 
state, the number of paths between these points is finite (it is 
the number of link paths that join the start link and the end 
link). 

Not every path is compatible with given start point and end 
point: the path must start at the start state and must end at 
the end state. We formally express the compatibility between 
a state x and the start state of a path p with the compatibility 
function S: 



6{x,p) 



1 if the path p starts at point x 
otherwise 



Similarly, we introduce the compatibility function S to 
express the agreement between a state and the end state of 



Figure 7. Illustration of the Conditional Random Field defined over a 
trajectory r = x^p^x^p^x^ and a sequence of observations g^'^ . The 
gray nodes indicate the observed values. The solid lines indicate the factors 
between the variables: uo (yg^\x^) between a state x^ and an observation g^, 
6 (a:*,p*) T] (p*) between a state x^ and a path and 6 (^p^,x^~^^) between 
a path p^ and a subsequent state x^~^-^. 



Figure 8. A Dynamic Bayesian Network (DBN) commonly used to model 
the trajectory reconstruction problem. The arrows indicated the directed 
dependencies of the variables. The GPS observations g^ are generated from 
states X*. The unobserved paths p* are generated from a state x^, following 
a transition probability distribution it (p\x). The transition from a path p^ to 
a state x* follows the transition model tt {x\p). 



a path: 



5 (p, x) 



if the path p ends at point x 
otherwise 



Given a sequence of observations g^'- 
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x^p^ ■ 



• X 



- 9^ ' " 9^ and 
we define the 



an associated trajectory r 
unnormalized score, or potential, of the trajectory as: 

VT-l 



\A9 



1:T\ 



\{u:{g'\x')5_{x\p')ri{p')5{p\x 



t = l 



The non-negative function (j) is called the potential function. 
A trajectory r is said to be a compatible trajectory with the 
observation sequence g^'-^ if ^(t|^^-^) > 0. When properly 
scaled, the potential (/) defines a probability distribution over 
all possible trajectories, given a sequence of observations: 

The variable Z, called the partition function, is the sum of the 
potentials over all the compatible trajectories: 

T 

We have combined the observation model uj and the tran- 
sition model T] into a single potential function (j), which 
defines an unnormalized distribution over all trajectories. Such 
a probabilistic framework is called a Conditional Random 
Field (CRF) 1221 . A CRF is an undirected graphical model 
which is defined as the unnormalized product of factors over 
cliques of factors (see Figure [7]). There can be an exponentially 
large number of paths, so the partition function cannot be 
computed by simply summing the value of (j) over every 



possible trajectory. As will be seen in Section |IVj the value 
of Z needs to be computed only during the training phase. 
Furthermore it can be computed efficiently using dynamic 
programming. 

The case against the Hidden Markov Model approach. 

The classical approach to filtering in the context of trajectories 
is based on Hidden Markov Models (HMMs), or their general- 
ization. Dynamic Bayesian Networks (DBNs) |26|: a sequence 
of states and trajectories form a trajectory, and the coupling of 



trajectories and states is done using transition models tt {x\p) 
and TT {p\x). See Figure [s] for a representation. 

This results in a chain- structured directed probabilistic 
graphical model in which the path variables p^ are unob- 
served. Depending on the specifics of the transition models, 
TT and TT probabilistic inference has been done with 
Kalman filters 1281 . 1291 , the forward algorithm or the Viterbi 
algorithm (H, O, or particle filters ifTTl . 

Hidden Markov Model representations, however, suffer 
from the selection bias problem, first noted in the labeling 
of words sequences 1221 . which makes them not the best fit 
for solving path inference problems. Consider the example 
trajectory r = x^p^x'^ observed in our data, represented in 
Figure |9] For clarity, we consider only two states x\ and 
x\ associated with the GPS reading g^ and a single state 
x\ associated with g'^ . The paths {p]) ■ between x^ and x'^ 
may either be the lone path p\ from x\ to x\ that allows a 
vehicle to cross the Golden Gate Park, or one of the many 
paths between Cabrillo Street and Fulton Street that go from 
X2 to x^, including p\ and the HMM representation, 

the transition probabilities must sum to 1 when conditioned 
on a starting point. Since there is a single path from X2 to 
x^, the probability of taking this path from the state x\ will 
be TT = 1 so the overall probability of this path is 

TT {p\\g^) = TT Consider now the paths from x^ to 



xf: a lot of these paths will have a similar weight, since 
they correspond to different turns and across the lattice of 
streets. For each path p amongst these N paths of similar 
weight, B ayes' assumption implies tt [plx^) ~ so the 
overall probability of this path is tt (^p\g^) ^ -^tt {xHg^)- In 
this case, N can be large enough that tt {pi\g^) > tt {p\g^), 
and the remote path will be selected as the most likely path. 

Due to their structures, all HMM models will be biased 
towards states that have the least expansions. In the case of a 
road network, this can be pathological. In particular, the HMM 
assumption will carry the effect of the selection bias as long 
as there are long disconnected segments of road. This can be 
particularly troublesome in the case of road networks since 
HMM models will end up being forced to assign too much 
weight to a highway (which is highly disconnected) and not 
enough to the road network alongside the highway. Our model, 
which is based on Conditional Random Fields, does not have 
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Figure 9. Example of a failure case when using a Hidden Markov Model: 
the solid black path will be favored over all the other paths. 



this problem since the renormalization happens just once and 
is over all paths from start to end, rather than renormalizing 
for every single state transition independently. 

Efficient filtering algoritiims. Using the probabilistic 
framework of the CRF, we wish to infer: 

• the most likely trajectory r: 



r* = argmax tt {r\g 



1:T\ 



(1) 



• the posterior distributions over the elements of the tra- 
jectory, i.e. the conditional marginals ir (^x^\g^-^) and 

As will be seen, both elements can be computed without 
having to obtain the partition function Z. The solution to both 
problems is a particular case of the Junction Tree algorithm 
(261 and can be computed in time complexity linear in the 
time horizon by using a dynamic programing formulation. 
Computing the most likely trajectory is a particular instan- 
tiation of a standard dynamic programing algorithm called 
the Viterbi algorithm |14|. Using a classic Machine Learning 
algorithm for chain- structured junction trees (the forward- 
backward algorithm [|33i , [6\), all the conditional marginals 
can be computed in two passes over the variables. In the next 
section, we detail the justification for the Viterbi algorithm and 



in Section III-C we describe an efficient implementation of the 
forward-backward algorithm in the context of this application. 

B. Finding the most likely path 

For the rest of this section, we fix a sequence of observations 
g^'-^. For each observation g^, we consider a set of candidate 
state projections At each time step t G [1---T — 1], we 
consider a set of paths p^ so that each path from starts 
from some state G and ends at some state G x^+^. 
We will consider the set ^ of valid trajectories in the Cartesian 
space defined by these projections and these paths: 



r = x^p^ 



■p 



-'x^\ 



x' e x^ 
p^ e 
5{x\p') = 



1 



S{p\i 



In particular, if P is the number of candidate states associated 
with g^ (i.e. the cardinal of x^) and is the number of 
candidate paths in p\ then there are at most Ilf ~^ 



possible trajectories to consider. We will see, however, that 
most likely trajectory r* can be computed in 0(T/*J*) 
computations, with /* = max^ P and J* = max^ J^. 

The partition function Z does not depend on the current tra- 
jectory r and need not be computed when solving Equation [T] 

r* = argmax tt [r\g^'^) 

= argmax (j) {r\g^-^) 

Call (/)* {g^'^) the maximum value over all the potentials 
of the trajectories compatible with the observations g^'^: 

The trajectory r that realizes this maximum value is found 
by tracing back the computations. For example, some point- 
ers to the intermediate partial trajectories can be stored to 
trace back the complete trajectory, as done in the referring 
implementation Q. This is why we will only consider the 
computation of this maximum. The function <j) depends on the 
probability distributions uo and r], left undefined so far. These 



distributions will be presented in depth in Sections |III-D| and 

It is useful to introduce notation related to a partial trajec- 
tory. Call r^-^ the partial trajectory until time step t\ 



x^p^ 



For a partial trajectory, we define some partial potentials 
(j) (r^'^l^^-^) that depend only on the observations seen so far: 

t' = l 

•5"(/,x*'+i)c.(/+i|x*'+i) (2) 

For each time step t, given a state index i G [1,/^] we 
introduce the potential function for trajectories that end at the 



One sees: 



max 



The partial potentials defined in Equation ^ follow an induc- 
tive identity: 



t+i 



mgix 

ie[i,J*] 



(3) 



By using this identity, the maximum potential ^* can be 
computed efficiently from the partial maximum potentials ^f. 
The computation of the trajectory that realizes this maximum 
potential ensues by tracing back the computation to find which 
partial trajectory realized for each step t. 
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J=4 

(Unseen 
so far) 

t=3 



t=2 




t=l 



Figure 10. Example of case handled by lagged smoothing which disam- 
biguates the results provided by tracking. An observation is available close 
to an exit ramp of a highway, for which the algorithm has to decide if it 
corresponds to the vehicle exiting the highway. Lagged smoothing analyzes 
subsequent points in the trajectory and can disambiguate the situation. 



C. Trajectory filtering and smoothing 

We now turn our attention to the problem of computing the 
marginals of the posterior distributions over the trajectories, 
i.e. the probability distributions n {x^\g^'^) and {p^\g^''^) 
for all t. We introduce some additional notation to simplify 
the subsequent derivations. The posterior probability q\ of the 
vehicle being at the state x\ G at time given all the 
observations g^'-^ of the trajectory, is defined as: 



The operator ex indicates that this distribution is defined up to 
some scaling factor, which does not depend on x or p (but may 
depend on g^'^). Indeed, we are interested in the probabilistic 
weight of a state xj relative to the other possible states x 
at the state time t (and not to the actual, unsealed value 
of 7T [xl\g^-^)). This is why we consider {ql). as a choice 
between a (finite) set of discrete variables, one choice per 
possible state x-. A natural choice is to scale the distribution 
qj so that the probabilistic weight of all possibilities is equal 
to 1: ^ 

1 



E 

i<i</* 
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From a practical perspective, q^ can be computed without 
the knowledge of the partition function Z. Indeed, the only 
required elements are the unsealed values of tt {xl\g^-^) for 
each i. The distribution q^ = (^|) • is a multinomial distribution 
between choices, one for each state. The quantity g| has a 
clear meaning: it is the probability that the vehicle is in state 
when choosing amongst the set (^i/)i<^</t, given all the 
observations g^'-^ . 

For each time t and each path index j G [1 • • • J^], we also 
introduce (up to a scaling constant) the discrete distribution 
over the paths at time t given the observations g^'-^: 



which are scaled so that X]i<j< j* — ^^ 

This problem of smoothing in CRFs is a classic application 
of the Junction Tree algorithm to chain- structured graphs |26 |. 
For the sake of completeness, we derive an efficient smoothing 
algorithm using our notations. 

The definition of tt {xl\g^-^) requires summing the poten- 
tials of all the trajectories that pass through the state xj at 
time t. The key insight for efficient filtering or smoothing is 
to make use of the chain structure of the graph, which lets us 
factorize the summation into two terms, each of which can be 
computed much faster than the exponentially large summation. 
Indeed, one can show from the structure of the clique graph 
that the following holds for all time steps t: 



(4) 



The first term of the pair corresponds to the effect that the 
past and present observations (g^'*) have on our belief of the 
present state x^. The second term corresponds to the effect 
that the future observations (g^~^^'^) have on our estimation 
of the present state. The terms tt (^x^\g^-^) are related to each 
other by an equation that propagates /(9rwarJ in time, while the 
terms tt are related through an equation that goes 

backward in time. This is why we call tt (^x^\g^-^) the forward 

distribution for the states, and we denote itj^by ^ • 

The distribution ~^ I is proportional to the posterior probability 
7T (^xl\g^-^) and the vector = is normalized so 

that Y^l^i = I. We do this for the paths, by defining the 
forward distribution for the paths: 



1^] (XTT {p]\g 



Again, the distributions are defined up to a normalization 
factor so that each component sums to 1. 

In the same fashion, we introduce the backward distribu- 
tions for the states and the paths: 

Using this set of notations. Equation ^ can be rewritten: 



(X 



' 3 ' 3 



Furthermore, and are related through a pair of 
recursive equations: 

(X TT {x]\g'^) 



( 



~7^^j (X T] [p] 



\ 



(5) 



\:T\ 



^The arrow notation indicates that the computations for ~^ \ will be done 
forward in time. 
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(6) 



Similarly, the backward distributions can be defined recur- 
sively, starting from t = T: 

q i (xl 



r J OCT] [pj) 




E 



(7) 



(8) 



Details of the forward algorithm and backward algorithm 
are provided in the Algorithm [T] and Algorithm [2] below. The 
complete algorithm for smoothing is detailed in the Algorithm 
[3] below. 

Algorithm 1 Description of forward recursion 

Given a sequence of observations g^'^, a sequence of sets of 
candidate projections x^-^ and a sequence of sets of candidate 
paths p^-^~^: 

Initialize the forward state distribution: 

yi = l---I^:tl^u;{x}\g^) 

Normalize ~^ 
For every time step t from 1 to T — 1: 

Compute the forward probability over the paths: 
Vj = 1 . . . 

Normalize 1^^ 

Compute the forward probability over the states: 
Vi = 1 • • • /^+^ : 

Normalize 

Return the set of vectors (^~^^^ (^0 

The above smoothing algorithm requires all the observations 
of a trajectory in order to run. We have presented so far 
an a posteriori algorithm that requires full knowledge of 
measurements g^'-^ . In this form, it is not directly suitable for 
real-time applications that involve streaming data, for which 
the data is available up to t only. However, this algorithm can 
be adapted for a variety of scenarios: 

• Smoothing, also called offline filtering. This corresponds 
to getting the best estimate given all observations, i.e. to 
computing tt (^x^\g^-^) . The Algorithm [s] describes this 
procedure. 

• Tracking, filtering, or online estimation. This usage cor- 
responds to updating the current state of the vehicle as 
soon as a new streaming observation is available, i.e. 
to computing n (^x^\g^-^) . This is exactly the case the 
forward algorithm (Algorithm [T]) is set to solve. If one is 



Algorithm 2 Description of backward recursion 

Given a sequence of observations g^'^ , a sequence of sets of 
candidate projections x^-^ and a sequence of sets of candidate 
paths p^-^~^: 

Initialize the backward state distribution 

vi=i---/^: vr^i 

For every time step t from T — 1 to 1 : 

Compute the forward probability over the paths: 
Vj = 1 • • • 



Normalize 

Compute the forward probability over the states: 

Normalize V 
Return the set of vectors V^^ ^nd (^^^) 



Algorithm 3 Trajectory smoothing algorithm 

Given a sequence of observations g^'^, a sequence of sets of 
candidate projections x^-^ and a sequence of sets of candidate 
paths p^-^~^: 

Compute (^~^*^ ^^^^S ^^^^^^^ 

Compute V^) and (^^^) using the backward algorithm. 
For every time step t: 



Vj = 1 • • • r] 
Normalize 

Normalize 
Return the set of vectors {q^)^ and (f^) 



simply interested in the most recent estimate, then only 
the previous forward distribution needs to be kept, 
and all distributions • • • at previous times can 

be discarded. This application minimizes the latency and 
the computations at the expense of the accuracy. 
Lagged smoothing, or lagged filtering. A few points of 
data are stored and processed before returning a result. 
Algorithm [4] details this procedure, which involves com- 
puting TT for some /c > 0. A trade-off is 
being made between the latency and the accuracy, as the 
information from the points is used to update the 
estimate of the state x^. As shown in Section VI even for 



small values of k, such a procedure can bring significant 
improvements in the accuracy while keeping the latency 
within reasonable bounds. A common ambiguity solved 



by lagged smoothing is presented in Figure 10 



D. Observation model 

We now describe the observation model uj. The observation 
probability is assumed only to depend on the distance between 
the point and the GPS coordinates. We take an isoradial 
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Algorithm 4 Lagged smoothing algorithm 

Given an integer k > 0, and a LIFO queue of observations: 

InitiaHze the queue to the empty queue. 

When receiving a new observation g^: 
Push the observation in the queue 
Run the forward filter on this observation 
lft>k\ 

Run the backward filter on the queue 



IV. Training procedure 

The procedure detailed so far requires the calibration of 
the observation model and the path selection model by set- 
ting some values for the weight vector /i and the standard 
deviation a. Using standard machine learning techniques, 
we maximize the likelihood of the observations with respect 
to the parameters, and we evaluate the result against held- 



Compute q 

Pop the queue and return q^~^ and f 



on the first element of the queue out trajectories using several metrics detailed in Section [VI 



^t-k 



Gaussian noise model: 



u;{g\x) = p{d{g,x)) 
1 



27rcr 



in which the function d is the distance function between geoco- 
ordinates. The standard deviation a is assumed to be constant 
over all the network. This is not true in practice because of 
well documented urban canyoning effects 1 10], 1361 , ITtII and 
satellite occlusions. Updating the model accordingly presents 
no fundamental difficulty, and can be done by geographical 
clustering of the regions of interest. Using the estimation 



techniques described later in Section IV-C and Section IV-D 



an estimate of a between 10 and 15 meters could be estimated 
for data of interest in this article. 

E. Driver model 

The second model to consider is the driver behavior model. 
This model assigns a weight to any acceptable path on the 
road network. We consider a model in the exponential family, 
in which the weight distribution over any path p only depends 
on a selected number of features {p) G of the path. 
Possible features include the length of the path, the number of 
stop signs, and the speed limits on the road. The distribution 
is parametrized by a vector /i G so that the logarithm of 
the distribution of paths is a linear combination of the features 
of the path: 

T] (p) (X exp {j^^if {p)) 

The function (p is called the feature function, and the vector /i 
is called the behavioral parameter vector, and simply encodes 
a weighted combination of the features. 

In a simple model the vector (p (p) may be reduced to a 
single scalar, such as the length of the path. Then the inverse 
of /i, a length, can be interpreted as a characteristic length. 
This model simply states that the driver has a preference for 
shorter paths, and indicates how aggressively this driver 
wants to follow the shortest path. Such a model is explored in 
Section ( fVll ). Other models considered include the mean speed 
and travel times, the stop signs and signals, and the turns to 
the right or to the left. 

In the Mobile Millennium system, the path inference filter 
is the input to a model designed to learn travel times, so the 
feature function does not include dynamic features such as the 
current travel time. Assuming this information is available, it 
would be easy to add as a feature. 



Computing likelihood will require the computation of the 
partition function (which depends on /i and a). We first present 
a procedure that is valid for any path or point distributions 
that belong to the exponential family, and then show how the 



models we presented in Section III fit into this framework 



A. Learning within the exponential family and sparse trajec- 
tories 

There is a striking similarity between the state variables 



and the path variables p 



1:T 



especially between the 



forward and backward distributions introduced in Equation ([5]). 
This suggests to generalize our procedure to a context larger 
than states interleaved with paths. Indeed, each step of choos- 
ing a path or a variable corresponds to making a choice 
between a finite number of possibilities, and there is a limited 
number of pairwise compatible choices (as encoded by the 
functions 5_ and 5). Following a trajectory corresponds to 
choosing a new state (subject to the compatibility constraints 
of the previous state). In this section, we introduce the proper 
notation to generalize our learning problem, and then show 
how this learning problem can be efficiently solved. In the 
next section, we will describe the relation between the new 
variables we are going to introduce and the parameters of our 
model. 

Consider a joint sequence of multinomial random variables 
z^-^ = • • • drawn from some space Y^=i {l • • • 
where is the dimensionality of the multinomial variable 
7}. Given a realization z^'^ from t}'-^ , we define a non- 
negative potential function ijj (^z^-^^ over the sequence of 
variables. This potential function is controlled by a parameter 
vector e e R^: i/j {z^'-^) = {z^'-^;0) ^ Furthermore, we 
assume that this potential function is also defined and non- 
negative over any subsequence i/j^z^'-^). Lastly, we assume 
that there exists at least one sequence z^'-^ that has a positive 
potential. As in the previous section, the potential function ip, 
when properly normalized, defines a probability distribution 
of density tt over the variables z, and this distribution is 
parametrized by the vector 6\ 



7r{z;0) = 



Z{0) 



(9) 



with Z = "^^i^ {z;6) called the partition function. We 
will show the partition function defined here is the partition 
function introduced in Section IIII-CI 

^The semicolon notation indicates that this function is parametrized by 6, 
but that 6 is not a random variable. 
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We assume that ij) is an unsealed member of the exponential 
family, it is of the form: 



1 



(10) 



^=1 



In this representation, is a non-negative function of z which 
does not depend on the parameters, the operator • is the vector 
dot product, and the vectors (z^) are vector mappings from 
the reaHzation z^ to for some M G N, called feature 
vectors. Since the variable z^ is discrete and takes on values 
in { 1 ••• iiT^ }, it is convenient to have a specific notation for 
the feature vector associated with each value of this variable: 

Mi e {l---K^},Tl =T^ {z^ =i) 

The sequence of variables z represents the choices asso- 
ciated with a single trajectory, i.e. the concatenation of the 
xs and ps. In general, we will observe and would like to 
learn from multiple trajectories at the same time. This is why 
we need to consider a collection of variables (z*^^^)^, each 
of which follows the form above and each of which we can 
define a potential (z*^^^; O) and a partition function Z*^^^ {0) 
for. There the variable u indexes the set of sequences of 
observations, i.e. the set of consecutive GPS measurements 
of a vehicle. Since each of these trajectories will take place 
on a different portion of the road network, each of the 
sequences z*^^^ will have a different state space. For each of 
these sequences of variables z^'^\ we observe the respective 
realizations z*^^^ (which correspond to the observation of a 
trajectory), and we wish to infer the parameter vector 6>* 
that maximizes the likelihood of all the realizations of the 
trajectories: 

r =argmax^log7r(^) (z^^);6>) 

= argmax^logV(z(");^) -logZ(")(^) 

= arg max ^ ^ l9 • T^^^^ (z^^^^ ) - log Z^^^ ((9) 

where again the indexing u is for sets of measurements of 
a given trajectory. Similarly, the length of a trajectory is 
indexed by w. L^^\ From Equation [TI] it is clear that the 
log-likelihood function simply sums together the respective 
likelihood functions of each trajectory. For clarity, we consider 
a single sequence z^^^ only and we remove the indexing with 
respect to u. With this simplification, we have for a single 
trajectory: 

L 

log (z; 0) - log Z (i9) = ^ (9 • {z^) - log Z (0) 

(12) 

The first part of Equation ([12]) is linear with respect to 
and log Z (0) is concave in (it is the logarithm of a sum 
of exponentiated linear combinations of 6> Q) . As such, 
maximizing Equation (12) yields a unique solution (assuming 
no singular parametrization), and some superlinear algorithms 
exist to solve this equation |7|. These algorithms rely on 



the computation of the gradient and the Hessian matrix of 
log Z {0). We now detail some closed-form recursive formulas 
to compute these elements. 

1 ) Efficient estimation of the partition function: A naive 
approach to the computation of the partition function Z (0) 
leads to consider exponentially many paths. Most of these 
computations can be factored using dynamic programming |^ 
Recall the definition of the partition function: 

z 1=1 

So far, the function h was defined in in a generic way (it is non- 
negative and does not depend on 0). We consider a particular 
shape that generalizes the functions S and S introduced in the 
previous section. In particular, the function h is assumed to be 
a binary function, from the Cartesian space fl^i {l • • • 
to {0, 1}, that decomposes to the product of binary functions 
over consecutive pairs of variables: 

L-l 



h{z)=Y[ {z\z^-^) 



1=1 

in which every function is a binary indicator : 
{l---K^} X {1 • • • K^-^} _{0, 1}. These functions 
generalize the functions S and S for arguments z equal to 
either the xs or the ps. It indicates the compatibility of the 
values of the instantiations z^ and z^~^ 

Finally, we introduce the following notation. For each index 
/ G [1 • • • L] and subindex i e [I ■ ■ ■ K^], call Z- the partial 
summation of all partial paths z^-^ that terminate at the value 



ziio) 



m=l 
I 



'.r'"(z'") 



m=2 



This partial summation can also be defined recursively: 

je[l...Ki-^]:h\z\z3) = l 

The start of the recursion is for alH G {l • • • K^}: 
Z} (9) = e<^-^l 

and the complete partition function is the summation of the 
auxiliary values: 

Z{0) = ^Zt{0) 

i=l 

Computing the partition function can be done in polynomial 
time complexity by a simple application of dynamic program- 
ming. By using sparse data structures to implement h, some 
additional savings in computations can be mad^ 

^This is - again - a specific application of the junction tree algorithm. See 
(26 ) for an explanation of the general framework. 

^In particular, care should be taken to implement all the relevant computa- 
tions in log-domain due to the limited precision of floating point arithmetic 
on computers. The reference implementation 1 1 1 shows one way to do it. 
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2) Estimation of the gradient: The estimation of the gra- 
dient for the first part of the log HkeHhood function is 
straightforward. The gradient of the partition function can also 
be computed using Equation ([TSj: 



VbZ\ (9) = Z\ {0) Tl + e"-^. ^^^Y' (^) 

The Hessian matrix can be evaluated in similar fashion: 



and similarly, that the observed state x\^^^^^^^ is one of the 
possible states: 



3ie [I---/*] 



^observed 



In this case, the values of and are known (they 
are the matching indexes), and the optimization problem 
of Equation ^H) can be solved using methods outlined in 
Section lESl 



l\eeZ\ {0) =Zl {0) [tI) [tI) 



J2 v,zj-i(^) (t;')' 



\j:h\z\z3) = l 



\j:h\z\z3) = l 
j:hi{z\zj) = l 

B. Exponential family models 

We now express our formulation of Conditional Random 
Fields to a form compatible with Equation ([TO]). 

Consider e = cr~^ and 6 the stacked vector of the desired 
parameters: 

e 
/i 

There is a direct correspondence between the path and state 
variables with the z variables introduced above. Let us pose 
L = 2T - 1, then for all / G [1, L] we have: 

and the feature vectors are simply the alternating values of 
(p and d, completed by some zero values: 







'2t-l 








These formulas establish how we can transform our learning 
problem that involves paths and states into a more abstract 
problem that considers a single set of variables. 

C. Supervised learning with known trajectories 

The most straightforward way to learn /i and a, or equiv- 
alently to learn the joint vector 0, is to maximize the likeli- 
hood of some GPS observations g^'^ , knowing the complete 
trajectory followed by the vehicle. For all time t, we also 
know which path ^observed taken and which state ^observed 
produced the GPS observation g^. We make the assumption 
that the observed path ^observed possible path 

amongst the set of candidate paths (p^ ) . : 



D. Unsupervised learning with incomplete observations: 
Expectation-Maximization 

Usually, only the GPS observations g^'-^ are available; the 
values of r^-^~^ and q^'-^ (and thus z^'-^) are hidden to us. In 
this case, we estimate the expected likelihood C, which is the 
expected value of the likelihood under the distribution over 
the assignment variables z^-^: 

C{0) = E,^,(.|,)[log(^(z;^))] (14) 
= ^^(z;^)log(^(z;^)) (15) 

z 

The intuition behind this expression is quite natural: since 
we do not know the value of the assignment variable z, we 
consider the expectation of the likelihood over this variable. 
This expectation is done with respect to the distribution 
7t{z;0). The challenge lies in the dependency in of the 
very distribution used to take the expectation. Computing the 
expected likelihood becomes much more complicated than 
simply solving the optimization problem of ( pT| ). 

One strategy is to find some "fill in" values for z that would 
correspond to our guesses of which path was taken, and which 
point made the observation. However, such a guess would 
likely involve our model for the data, which we are currently 
trying to learn. A solution to this chicken and egg problem 
is the Expectation Maximization (EM) algorithm |25|. This 
algorithm performs an iterative projection ascent by assigning 
some distributions (rather than singular values) to every z\ 
and uses these distributions to updates the parameters /i and 



a using the procedures seen in Section |IV-C| This iterative 
procedure performs two steps: 

1) Fixing some value for 6, it computes a distribution 
n {z) = TT {z; 0) 

2) It then uses this distribution tt (z) to compute some new 
value of 6 by solving the approximate problem in which 
the expectation is fixed with respect to 0: 



maxE^^^(.) [log (tt (z; 9))] 



(16) 



3jG [1---J'] : P\ 



observed 



This problem is significantly simpler than the optimiza- 
tion problem in Equation ([14]) since the expectation 
itself does not depend on 9 and thus is not part of the 
optimization problem. 
Under this procedure, the expected likelihood is shown to 
converge to a local maximum |26|. It can be shown that good 
values for the plug-in distribution tt are simply the values of 
the posterior distributions tt {p^\g^'^) and tt i.e. the 

values q^ and f^ Furthermore, owing to the particular shape 
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Algorithm 5 Expectation maximization algorithm for learning 
parameters without complete observations. 
Given a set of sequences of observations, an initial value of 6 
Repeat until convergence: 

For each sequence, compute and using Algorithm [s] 
For each sequence, update expected values of using 
^ and ([18]). 

Compute a solution of Problem ( pTj ) using these new values 
of TK 



of the distribution 7r(z), taking the expectation is a simple 
task: we simply replace the value of the feature vector by its 
expected value under the distribution tt (2:). More practically, 
we simply have to consider: 









(17) 



in which 



and 



^2t-l / 2.-1 



(18) 



so that 



J' 



These values of feature vectors plug directly into the super- 
vised learning problem in Equation ( pTj ) and produce updated 
parameters /i and a, which are then used in turn for updating 
the values of q and f and so on. 

V. Results from field operational test 

The path inference filter and its learning procedures were 
tested using field data through the Mobile Millennium system. 
Ten San Francisco taxicabs were fit with high frequency GPS 
(1 second sampling rate) in October 2010 during a two- 
day experiment. Together, they collected about seven hundred 
thousand measurement points that provided a high-accuracy 
ground truth. Additionally, the unsupervised learning filtering 
was tested on a significantly larger dataset: one day one-minute 
samples of 600 taxis from the same fleet, which represents 600 
000 points. For technical reasons, the two datasets could not 
be collected the same day, but were collected the same day 
of the week (a Wednesday) three weeks prior to the high- 
frequency collection campaign. Even if the GPS equipment 
was different, both datasets presented the same distribution 
of GPS dispersion. Thus we evaluate two datasets collected 
from the same source with the same spatial features: a smaller 
set at high frequency, called "Dataset 1", and a larger dataset 
sampled at 1 minute for which we do not know ground truth, 
called "Dataset 2". 




Figure 11. Example of points collected in "Dataset 1", in the Russian Hill 
neighborhood in San Francisco. The (red) dots are the GPS observations 
(collected every second), and the green lines are road links that contain a 
state projection. The black lines show the most likely projection of the GPS 
points on the road network, using the Viterbi algorithm on a gridded state- 
space with a 1 -meter grid for the offsets. 



Algorithm 6 Evaluation procedure 

Given a set of high-frequency sequences of raw GPS data: 

1) Map the raw high-frequency sequences on the road 
network 

2) Run the Viterbi algorithm with default settings 

3) Extract the most likely HE trajectory on the road net- 
work for each sequence 

4) Given a set of projected HE trajectories: 

a) Decimate the trajectories to a given sampling rate 

b) Separate the set into a training subset and a test 
subset 

c) Compute the best model parameters for a number 
of learning methods (most likely, EM with a simple 
model or a more complex model) 

d) Evaluate the model parameters with respect to dif- 
ferent computing strategies (Viterbi, online, offline, 
lagged smoothing) on the test subset 



A. Experiment design 

The testing procedure is described in Algorithm |6] the 
filter was first run in trajectory reconstruction mode (Viterbi 
algorithm) with settings and-tuned for a high-frequency appli- 
cation, using all the samples, in order to build a set of ground 
truth trajectories. The trajectories were then downsampled to 
different temporal resolutions and were used to test the filter 
in different configurations. The following features were tested: 

• The sampling rate. The following values were tested: 1 
second, 10 seconds, 30 seconds, one minute, one and a 
half minute and two minutes 

• The computing strategy: pure filtering ("online" or for- 
ward filtering), fixed-lagged smoothing with a one- or 
two-point buffer ("1-lag" and "2-lag" strategies), Viterbi 
and smoothing ("offline", or forward-backward proce- 
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dure). 
• Different models: 

- "Hard closest point": A greedy deterministic model 
that computes the closest point and then finds the 
shortest path to reach this closest point from the 
previous point. This non-probabilistic model is the 
baseline against which we make comparison on 
[16 1 . This greedy model may lead to non-feasible 
trajectories, for example by assigning an observation 
to a dead end link from which it cannot recover. 

- "Closest point" : A non-greedy version of "Hard 
closest point". Among all the feasible trajectories, 
this (naive, deterministic) model projects all the GPS 
data to their closest projections and then selects the 
shortest path between each projection. The comput- 
ing strategy chosen is important because the filter 
may determine that some projections lead to dead 
end and force the trajectory to break. 

- "Shortest path": A naive model that selects the 
shortest path. Given paths of the same length, it will 
take the path leading to the closest point. The points 
projections are then recovered from the paths. This 
is similar to f40l. 

- "Simple" A simple model that considers two features 
that could be tuned by hand: 

1) ^1 : The length of the path 

2) ^2 The distance of a point projection to its GPS 
coordinate 

This model was trained on learning data by two 
procedures: 

* Supervised learning, in which the true trajectory 
is provided to the learning algorithm leading to 
the "MaxLL-Simple" model 

* Unsupervised learning, which produced the model 
called "EM-Simple" 

- "Complex" : A more complex model with a more 
diverse set of features, which is complicated enough 
to discourage manual tuning: 

1) The length of the path 

2) The number of stop signs along the path 

3) The number of signals (red lights) 

4) The number of left turns made by the vehicle at 
road intersections 

5) The number of right turns made by the vehicle 
at road intersections 

6) The minimum average travel time (based on the 
speed limit) 

7) The maximum average speed 

8) The maximum number of lanes (representative of 
the class of the road) 

9) The minimum number of lanes 

10) The distance of a point to its GPS point 
This model was first evaluated using supervised 
learning leading to the model called "MaxLL- 
Complex". The unsupervised learning procedure was 
also tried but failed to properly converge when using 
"Dataset 1", obtained from high-frequency samples. 
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Unsupervised learning was run again with "Dataset 
2", using the simple model as a start point and 
converged properly this time. This set of parameters 
is presented under the label "EM-Complex". 
All the models above are specific cases of our framework: 

• "Simple" is a specific case of "Complex", by restricting 
the complex model to only two features. 

• "Shortest path" is a specific case of "Simple" with |^i 
1, 161 < 1. We used ^ = -1000 and ^ = -0.001 

• "Closest point" is a specific case of "Simple" with 
1, 161 > 1. We used 6 = -0.001 and 6 = -1000 

• "Hard closest point" can be reasonably approximated 
by running the "Closest point" model with the Online 
filtering strategy. 

Thanks to this observation, we implemented all the model 
using the same code and simply changed the set of features 
and the parameters 11). 

These models were evaluated under a number of metrics: 

• The proportion of path misses: for each trajectory, it is 
the number of times the most likely path was not the true 
path followed, divided by the number of time steps in the 
trajectory. 

• The proportion of state misses: for each trajectory, the 
number of times the most likely projection was not the 
true projection. 

• The log-likelihood of the true point projection. This is 
indicative of how often the true point is identified by the 
model. 

• The log-likelihood of the true path. 

• The entropy of the path distribution and of the point dis- 
tribution. This statistical measure indicates the confidence 
assigned by the filter to its result. A small entropy (close 
to 0) indicates that one path is strongly favored by the 
filter against all the other ones, whereas a large entropy 
indicates that all paths are equal. 

• The miscoverage of the route. Given two paths p and 
p' the coverage of p by p\ denoted coy {p^p') is the 
amount of length of p that is shared with p' (it is a 
semi-distance since it is not symmetric). It is thus lower 
than the total length \p\ of the path p. We measure the 
dissimilarity of two paths by the relative miscoverage: 
mc {p) = 1 — ^^^i^^^i'-^^ . If a path is perfectly covered, its 
relative miscoverage will be 0. 

For about 0.06% of pairs of points, the true path could not be 
found by the A* algorithm and was manually added to the set 
of discovered paths 

Each training session was evaluated with k-fold cross- 
validation, using the following parameters: 
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Figure 12. Point misses using trajectory reconstruction (Viterbi algorithm) 
for different sampling rates, as a percentage of incorrect point reconstructions 
for each trajectory (positive, smaller is better). The solid line denotes the 
median, the squares denote the mean and the dashed lines denote the 
94% confidence interval. The black curve is the performance of a greedy 
reconstruction algorithm, and the colored plots are the performances of 
probabilistic algorithms for different features and weights learned by different 
methods. As expected, the error rate is close to for high frequencies (low 
sampling rates): all the points are correctly identified by all the algorithms. 
In the low frequencies (high sampling rates), the error still stays low (around 
10%) for the probabilistic models, and also for the greedy model. For sampling 
rates between 10 seconds and 90 seconds, tuned models show a much higher 
performance compared to greedy models (Hard closest point, closest point and 
shortest path). However, we will see that the errors made by tuned models 
are more benign than errors made by simple greedy models. 



B. Results 

Given the number of parameters to adjust, we only present 
the most saHent results here. 

The most important practical result is the raw accuracy of 
the filter: for each trajectory, which proportion of the paths 
or of the points was correctly identified? These results are 



presented in Figure [12] and Figure [13] As expected, the error 
rate is for high frequencies (low sampling period): all the 
points are correctly identified by all the algorithms. In the 
low frequencies (high sampling periods), the error is still low 
(around 10%) for the trained models, and also for the greedy 
model ("Hard closest point"). For sampling rates between 
10 seconds and 90 seconds, trained models ("Simple" and 
"Complex") show a much higher performance compared to 
untrained models ("Hard closest point", "Closest point" and 
"Shortest path"). 

We now turn our attention to the resilience of the models, 
i.e. how they perform when they make mistakes. We use two 
statistical measures: the (log) likelihood of the true paths (Fig- 
ure [14]) and the entropy of the distribution of points or paths 



(Figures [15] and [16]). Note that in a perfect reconstruction with 
no ambiguity, the log likelihood would be zero. Interestingly, 
the log likelihoods appear very stable as the sampling interval 
grows: our algorithm will continue to assign high probabilities 
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Figure 13. Path misses using the Viterbi reconstruction for different models 
and different sampling rates, as a percentage on each trajectory (lower is 
better). The solid line denotes the median, the squares denote the mean and 
the dashed lines denote the 98% percentiles. The error rate is close to for 
high frequencies: the paths are correctly identified. In higher sampling regions, 
there are many more paths to consider and the error increases substantially. 
Nevertheless, the probabilistic models still perform very well: even at 2 minute 
intervals, they are able to recover about 75% of the true paths. In particular, in 
these regions the shortest path becomes a viable choice for most paths. Note 
how the greedy path reconstruction fails rapidly as the sampling increases. 
Also note how the shortest path heuristic performs poorly. 



to the true projections even when many more paths can be used 
to travel from one point to the other. The performance of the 
simple and the complex models improves greatly when some 
backward filtering steps are used, and stays relatively even 
across different time intervals. 

We conclude the performance analysis by a discussion of 
the miscoverage (Figure [TT]). The miscoverage gives a good 
indication of how far the path chosen by the filter differs from 
the true path. Even if the paths are not exactly the same, 
some very similar path may get selected, that may differ 
by a turn around a block. Note that the metric is based on 
length covered. At high frequency however, the vehicle may be 
stopped and cover a length 0. This metric is thus less useful at 
high frequency. A more complex model improves the coverage 
by about 15% in smoothing. In high sampling resolution, the 
error is close to zero: the paths considered by the filter, even if 
they do not match perfectly, are very close to the true trajectory 
for lower frequencies. Two groups clearly emerge as far as 
computing strategies are concerned: the online/1 -lag group 
(orange and red curves) and the 2-lag and offline group (green 
and blue curves). The relative miscoverage for the latter group 
is so low that more than half of the probability mass is at zero. 
A number of outliers still raise the curve of the last quartile 
as well as the mean, especially in the lower frequencies. The 
paths inferred by the filter are never dramatically different: 
at two minute time intervals (for which the paths are 1.7km 
on average), the returned path spans more than 80% of the 
true path on average. The use of a more complicated model 
decreases the mean miscoverage as well as all quartile metrics 
by more than 15%. 
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Figure 14. (Negative of) Log likelihood of true paths for different strategies 
and different sampling rates (positive, lower is better). The error bars denote 
the first and last quartiles (the 25th and 75th percentiles). The solid line 
denotes the median, the squares denote the mean and the dashed lines denote 
the 98% confidence interval. The likelihood decreases as the sampling interval 
increases, which was to be expected. Note the relatively high mean likelihood 
compared to the median : a number of true paths are assigned very low 
likelihood by the model, but this phenomenon is mitigated by using better 
filtering strategies (2-lagged and smoothing). The use of a more complex 
model (that accounts for a finer set of features for each path) brings some 
improvements on the order of 25% of all metrics. The behavior around high 
frequencies (1 and 10 second time intervals) is also very interesting. Most 
of the paths are chosen nearly perfectly (the median is 0), but the filters 
are generally too confident and assign very low probabilities to their outputs, 
which is why the likelihood has a very heavy tail at high frequency. Note also 
that in the case of high frequency, the use of an offline filter brings significantly 
more accurate results than a 2-lagged filter. This difference disappears rapidly 
(it becomes insignificant at 10 second intervals). Note how the EM trained 
filter performs worse in the low frequencies (note the difference of scale). The 
points for online strategy (red) and for 2-lagged filtering (green) do not appear 
because they are too close to the 1 -lagged and offline strategies, respectively. 
Again in the EM setting, the offline and 2-lagged filters perform considerably 
better than the cruder strategies. 



In the case of the complex model, the weights can provide 
some insight into the features involved in the decision-making 
process of the driver. In particular, for extended sampling rates 
(t=120s), some interesting patterns appear. For example, the 
drivers do not show a preference between driving through stop 
signs (W'^ = —0.24 ±0.07) or through signals {w/^ = —0.21 ± 
0.11). However, drivers show a clear preference to turn on 

This 



the right as opposed to the left, as seen in Figure 18 
is may be attributed, in part, to the difficulty in crossing an 
intersection in the United States. 

From a computation perspective, given a driver model, the 
filtering algorithm can be dramatically improved for about 
as much computations by using a full backward-forward 
(smoothing) filter. Smoothing requires backing up an arbitrary 
sequence of points while 2-lagged smoothing only requires 
the last two points. For a slightly greater computing cost, the 
filter can offer a solution with a lag of one or two interval time 
units that is very close to the full smoothing solution. Fixed- 
lag smoothing will be the recommended solution for practical 
applications, as it strikes a good balance of computation costs, 
accuracy and timeliness of the results. 

It should be noted the algorithm continues to provide decent 



Figure 15. Distributions of point entropies with respect to sampling and 
for different models. The colors show the performance of different filtering 
strategies (pure online, 1-lag, 2-lag and offline). The entropy is a measure 
of the confidence of the filter on its output and quantifies the spread of the 
probability distribution over all the candidate points. The solid line denotes 
the median, the squares denote the mean and the dashed lines denote the 
95% confidence interval. The entropy starts at nearly zero for high frequency 
sampling : the filters are very confident in their outputs. As sampling time 
increases, the entropy at the output of the online filter increases notably. Since 
the online filter cannot go back to update its belief, it is limited to pure forward 
prediction and as such cannot confidently choose a trajectory that would work 
in all settings. For the other filtering strategies, the median is close to zero 
while the mean is substantially higher. Indeed, the filter is very confident 
in its output most of the time and assigns a weight of nearly one to one 
candidate, and nearly zero to all the other outputs, but it is uncertain in a few 
cases. These few cases are at the origin of the fat tail of the distributions of 
entropies and the relatively wide confidence intervals. Note that using a more 
complex model improves the mean entropy by about 15%. Also, in the case 
of EM, the entropy is very low (note the difference of scale): the EM model 
is overconfident in its predictions and tends to assigns very large weights to 
a single choice, even if it not the good one. 



results even when points grow further apart. The errors steadily 
increase with the sampling rate until the 30 seconds time 
interval, after which most metrics reach some plateau. This 
algorithm could be used in tracking solutions to improve the 
battery life of the device by up to an order of magnitude for 
GPSs that do not need extensive warm up. In particular, the 
tracking devices of fleet vehicle are usually designed to emit 
every minute as the road-level accuracy is not a concern in 
most cases. 

C. Unsupervised learning results 

The filter was also trained for the simple and complex 
models using Dataset 2. This dataset does not include true 
observations but is two orders of magnitude larger than Dataset 
1 for the matching sampling period (1 minute). We report 
some comparisons between the models previously trained 
with Dataset 1 ("MaxLL-Simple", "EM- Simple", "MaxLL- 
Complex") and the same simple and complex models trained 
on Dataset 2: "EM- Simple large" and "EM-Complex large". 
The learning procedure was calibrated using cross-validation 
and was run in the following way: all unsupervised models 
were initialized with a hand-tuned heuristic model involving 
only the standard deviation and the characteristic length (with 
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Figure 16. Distributions of path entropies with respect to sampling period 
and for different models (positive, lower is better). The colors show the 
performance of different filtering strategies (purely online, 1-lag, 2-lag and 
offline) The entropy is a measure of the confidence of the filter on its output 
and quantifies the spread of the probability distribution over all the candidate 
paths. The solid line denotes the median, the squares denote the mean and 
the dashed lines denote the 95% confidence interval. Compared to the points, 
the paths distributions have a higher entropy: the filter is much less confident 
in choosing a single path and spreads the probability weights across several 
choices. Again, the use of 2-lagged smoothing is as good as pure offline 
smoothing, for the same computing cost and a fraction of the data. Online 
and 1 -lagged smoothing perform about as well, and definitely worse than 2- 
lagged smoothing. The use of a more complex model strongly improves the 
performance of the filter: it results in more compact distribution over candidate 
paths. Again, the model learned with EM is overconfident and tends to offer 
favor a single choice, except for a few path distributions. 



the weight of all the features set to 0). The Expectation- 
Maximization algorithm was then run for 3 iterations. Inside 
each EM iteration, the M-step was run with a single Newton- 
Raphson iteration at each time, using the full gradient and 
Hessian and a quadratic penalty of 10~^. During the E step, 
each sweep over the data took 13 hours 400 thousand points 
on a 32-core Intel Xeon server. 

We limit our discussion to the main findings for brevity. 
The unsupervised training finds some weight values similar to 
those found with supervised learning. The magnitude of these 
weights is larger than in the supervised settings. Indeed, during 
the E step, the algorithm is free to assign any sensible value 
to the choice of the path. This may lead to a self-reinforcing 
behavior and the exploration of a bad local minimum. 

As Figures [2T| [23] and [22] show, a large training dataset puts 
unsupervised methods on par with supervised methods as far 
as performance metrics are concerned. Also, the inspection 
of the parameters learned on this dataset corroborates the 
finding made earlier. One is tempted to conclude that given 
enough observations, there no need to collect expensive high- 
frequency data to train a model. 



D. Key findings 

Our algorithm can reconstruct a sensible approximation 
of the trajectory followed by the vehicles analyzed, even 
in complex urban environments. In particular, the following 
conclusions can be made: 
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Figure 17. Distribution of relative miscoverage of the paths (between and 
1, lower is better). The solid line denotes the median, the squares denote the 
mean and the dashed lines denote the 98% confidence interval. This metric 
evaluates how much of the true path the most likely path covers , with respect 
to length (0 if it is completely different, 1 if the two paths overlap completely). 
Two groups clearly emerge as far as computing strategies are concerned: the 
online/ 1-lag group (orange and red curves) and the 2-lag and offline group 
(green and blue curves). The relative miscoverage for the latter group is so 
low that more than half of the mass is at the and cannot be seen on the 
curve. There are still a number of outliers that raise the curve of the last 
quartile as well as the mean, especially in the lower frequencies. Note that the 
paths offered by the filter are never dramatically different: at two minute time 
intervals (for which the paths are 1.7km on average), the returned path spans 
more than 80% of the true path on average. The use of a more complicated 
model decreases the mean miscoverage as well as the quartile metrics by more 
than 15%. Note that there is a large spread of values at high frequency: indeed 
the metric is based on length covered and at high frequency, the vehicle may 
be stopped and cover length. This metric is thus less indicative at high 
frequency. 




Figure 18. Learned weights for left or right turns preferences. The error bars 
indicate the complete span of values computed for each time (0th and 100th 
percentile). For small time intervals, any turning gets penalized but rapidly 
the model learns how to favor paths with right turns against paths with left 
turns. A positive weight even means that - all other factors being equal! - the 
driver would prefer turning on the right than going straight. 
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Figure 19. Standard deviation learned by the simple models, in the supervised 
(Maximum Likelihood) setting and the EM setting. The error bars indicate 
the complete span of values computed for each time. Note that the maximum 
likelihood estimator rapidly converges toward a fixed value of about 6 meters 
across any sampling time. The EM procedure also rapidly converges, but it 
is overconfident and assigns a lower standard deviation overall. 
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Figure 20. Characteristic length learned by the simple models, in the 
supervised (Maximum Likelihood) setting and the EM setting. As hoped, it 
roughly corresponds to the expected path length. The error bars indicate the 
complete span of values computed for each time (0th and 100th percentile). 
Note how the spread increases for large time intervals. Indeed, vehicles have 
different travel lengths at such time intervals, ranging from nearly (when 
waiting at a signal) to more than 3 km (on the highway) and the models 
struggle to accommodate a single characteristic length. This justifies the use 
of more complicated models. 



Figure 22. Proportion of true points incorrectly identified, for different 
models evaluated with 1-minute sampling (lower is better). The central point 
is the mean proportion, the error bars indicate the 70% confidence interval. 
Unsupervised models are very competitive against supervised models, and the 
complex unsupervised model slightly outperforms all supervised models. 



Comparison of path assignment errors for 1-minute sampling intervals 
(Viterbi reconstruction) 
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Comparison of the log-likelihoods of the true paths for 1-minute sampling intervals 
(2-lagged smoothing reconstruction) 
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Figure 21. Expected likelihood of the true path. The central point is the 
mean log-likelihood, the error bars indicate the 70% confidence interval. Note 
that the simple model trained unsupervised with the small dataset has a much 
larger error, i.e. it assigns low probabilities to the true path. Both unsupervised 
models tend to express the same behavior but are much more robust. 



Figure 23. Proportion of true paths incorrectly identified, for different models 
evaluated with 1-minute sampling (lower is better). The central point is the 
mean proportion, the error bars indicate the 70% confidence interval. The 
complex unsupervised model is as good as the best supervised model. 



An intuitive deterministic heuristic ("Hard closest point") 
dramatically fails for paths at low frequencies, less so for 
points. It should not be considered for sampling intervals 
larger than 30 seconds. 

A simple probabilistic heuristic ("closest point") gives 
good results for either very low frequencies (2 minutes) 
or very high frequencies (a few seconds) with more 75% 
of paths and 94% points correctly identified. However, 
the incorrect values are not as close to the true trajectory 
as they are with more accurate models ("Simple" and 
"Complex"). 

For the medium range (10 seconds to 90 seconds), trained 
models (either supervised or unsupervised) have a greatly 
improved accuracy compared to untrained models, with 
80% to 95% of the paths correctly identified by the 
former. 
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• For the paths that are incorrectly identified, trained mod- 
els ("Simple" or "Complex") provide better results com- 
pared to untrained models (the output paths are closer to 
the true paths, and the uncertainty about which paths may 
have been taken is much reduced). Furthermore, using a 
complex model ("Complex") improves these results even 
more by a factor of 13-20% on all metrics. 

• For filtering strategies: online filtering gives the worst 
results and its performance is very similar to 1 -lagged 
smoothing. The slower strategies (2-lagged smoothing 
and offline) outperform the other two by far. Two-lagged 
smoothing is nearly as good as offline smoothing, except 
in very high frequencies (less than 2 second sampling) 
for which smoothing clearly provides better results. 

• Using a trained algorithm in a purely unsupervised fash- 
ion provides an accuracy as good as when training in 
a supervised setting - within some limits and assuming 
enough data is available. The model produced by EM 
("EM-Simple") is equally good in terms of raw perfor- 
mance (path and point misses) but it may be overconfi- 
dent. 

• With more complex models, the filter can be used to 
infer some interesting patterns about the behavior of the 
drivers. 

VI. Conclusions and future work 

We have presented a novel class of algorithms to track 
moving vehicles on a road network: the path inference filter. 
This algorithm first projects the raw points onto candidate 
projections on the road network and then builds candidate 
trajectories to link these candidate points. An observation 
model and a driver model are then combined in a Conditional 
Random Field to find the most probable trajectories. 

The algorithm exhibits robustness to noise as well as to 
the peculiarities of driving in urban road networks. It is 
competitive over a wide range of sampling rates (1 seconds 
to 2 minutes) and greatly outperforms intuitive deterministic 
algorithms. Furthermore, given a set of ground truth data, 
the filter can be automatically tuned using a fast supervised 
learning procedure. Alternatively, using enough regular GPS 
data with no ground truth, it can be trained using unsupervised 
learning. Experimental results show that the unsupervised 
learning procedure compares favorably against learning from 
ground truth data. One may conclude that given enough ob- 
servations, there no need to collect expensive high-frequency 
data to train a model. 

This algorithm supports a range of trade-offs between 
accuracy, timeliness and computing needs. In its most accurate 
settings, it extends the current state of the art (431, El. This 
result is supported by the theoretical foundations of Condi- 
tional Random Fields. Because no standardized benchmark 
exists, the authors have released an open- source implementa- 
tion of the filter to foster comparison with other methodologies 
using other datasets |1|. 

In conjunction with careful engineering, this program can 
achieve high map-matching throughput. The authors have writ- 
ten an industrial- strength version in the Scala programming 



language, deployed in the Mobile Millennium system. This 
version maps GPS points at a rate of about 400 points per 
second on a single core for the San Francisco Bay area (several 
hundreds of thousands of road links), and has been scaled 
to multicore architecture to achieve an average throughput of 
several thousand points per second |20|. 

A number of extensions could be considered to the core 
framework. In particular, more detailed models of the driver 
behavior as well as algorithms for automatic feature selec- 
tion should bring additional improvements in performance. 
Another line of research is the mapping of very sparse data 
(sampling intervals longer than two minutes). Although the fil- 
ter already attempts to consider as few trajectories as possible, 
more aggressive pruning may be necessary in order to achieve 
good performance. Finally, the EM procedure presented for 
automatically tuning the algorithm requires large amounts of 
data to be effective, and could be tested on larger datasets that 
what we have presented here. 
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Generalized potential function 


(jj = (jj (q\x^ 
^ ^ \iJ J 


Observation model 


nix) 
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GPS coordinate (pair of latitude and longitude) 
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Path between one mapping x and one subsequent mapping x^ 
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Collection of all J candidate trajectories between a set of candidate statesx and a subsequent set x' 
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Probability that the vehicle is in the discrete state xj at time t given all observations 




Probability that the vehicle is in the discrete state xj at time t given all observations up to time t 
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Probability that the vehicle uses the (discrete) path p^- at time t given all observations 
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Probability that the vehicle uses the (discrete) path p^- at time t given all observations after time t-\-l 
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Number of GPS observations for a track 




Generalized feature vector 
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Partition function 
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